Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

[

] and to investigate how gastrectomy impacts on gastric cancer

based on profiling faecal microbiome and metabolome

ntari, et al., 2020].

e definition and working principle of LDA

the n^th observation is represented by a vector ܠ௡ and the label of

oted by ݕ௡. In terms of protease cleavage pattern discovery, ܠ௡

de, which is labelled by ݕ௡ as either cleaved or non-cleaved. The

tion label for this type of data is binary. Normally a non-cleaved

௡^{is labelled by a zero, i.e.,}^ݕ௡^ൌ0^{and a cleaved peptide}^ܠ௡^is

by a one, i.e., ݕ௡ൌ1. A general format of a classification model

below,

ݕො௡ൌ݂ሺܠ௡, ܟሻ

(3.1)

is a vector of model parameters, ݂ is a classification function, ݕො௡

iction corresponding to ݕ௡. In a well-constructed classifier, ݕො௡

e a numerical value close to zero if ݕ௡ൌ0 and ݕො௡ should be a

l value close to one if ݕ௡ൌ1. If a classification problem is linear,

cation model can be formulated as below,

௡^ൌܟ^௧^ܠ௡^ൌݔ௡ଵ^ݓଵ^൅ݔ௡ଶ^ݓଶ^൅⋯ݔ௡ௗ^ݓௗ^↦ݕ௡

(3.2)

௜^{corresponds to the}ⁱ^th^{independent variable of vector}^ܠ௡^and^ݓ௜

r the i^th weight in ܟ, which is used to weigh the contribution of

A vector-matrix format of a linear classifier is formulated as

here X is an input matrix and ܡො is an output vector

ܡොൌ܆^௧ܟ

(3.3)

major part of LDA is to find the best projection direction to map a

ensional genotype space (X) to a one-dimensional phenotype

. To make a LDA model work, the density of ܡො is required to be

Only when this bimodality is maximised, should the projection

or the model parameters w be considered as an optimal solution